{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# PFRL Quickstart Guide\n",
    "\n",
    "This is a quickstart guide for users who just want to try PFRL for the first time.\n",
    "\n",
    "If you have not yet installed PFRL, run the command below to install it:\n",
    "```\n",
    "pip install pfrl\n",
    "```\n",
    "\n",
    "If you have already installed PFRL, let's begin!\n",
    "\n",
    "First, you need to import necessary modules. The module name of PFRL is `pfrl`. Let's import `torch`, `gym`, and `numpy` as well since they are used later."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pfrl\n",
    "import torch\n",
    "import torch.nn\n",
    "import gym\n",
    "import numpy"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "PFRL can be used for any problems if they are modeled as \"environments\". [OpenAI Gym](https://github.com/openai/gym) provides various kinds of benchmark environments and defines the common interface among them. PFRL uses a subset of the interface. Specifically, an environment must define its observation space and action space and have at least two methods: `reset` and `step`.\n",
    "\n",
    "- `env.reset` will reset the environment to the initial state and return the initial observation.\n",
    "- `env.step` will execute a given action, move to the next state and return four values:\n",
    "  - a next observation\n",
    "  - a scalar reward\n",
    "  - a boolean value indicating whether the current state is terminal or not\n",
    "  - additional information\n",
    "- `env.render` will render the current state. (optional)\n",
    "\n",
    "Let's try `CartPole-v0`, which is a classic control problem. You can see below that its observation space consists of four real numbers while its action space consists of two discrete actions."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "observation space: Box(4,)\n",
      "action space: Discrete(2)\n",
      "initial observation: [ 0.03923832  0.00510645 -0.03804804  0.00186333]\n",
      "next observation: [ 0.03934045  0.20075283 -0.03801078 -0.30257726]\n",
      "reward: 1.0\n",
      "done: False\n",
      "info: {}\n"
     ]
    }
   ],
   "source": [
    "env = gym.make('CartPole-v0')\n",
    "print('observation space:', env.observation_space)\n",
    "print('action space:', env.action_space)\n",
    "\n",
    "obs = env.reset()\n",
    "print('initial observation:', obs)\n",
    "\n",
    "action = env.action_space.sample()\n",
    "obs, r, done, info = env.step(action)\n",
    "print('next observation:', obs)\n",
    "print('reward:', r)\n",
    "print('done:', done)\n",
    "print('info:', info)\n",
    "\n",
    "# Uncomment to open a GUI window rendering the current state of the environment\n",
    "# env.render()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now you have defined your environment. Next, you need to define an agent, which will learn through interactions with the environment.\n",
    "\n",
    "PFRL provides various agents, each of which implements a deep reinforcement learning algorithm.\n",
    "\n",
    "Let's try using the DoubleDQN algorithm (https://arxiv.org/abs/1509.06461), which is implemented by `pfrl.agents.DoubleDQN`. This algorithm trains a Q-function that receives an observation and returns an expected future return for each action the agent can take. In PFRL, you can define your Q-function as `torch.nn.Module` as below. Note that the outputs are wrapped by `pfrl.action_value.DiscreteActionValue`. By wrapping the outputs of Q-functions, PFRL can support not only discrete-action Q-functions like this but also continuous-action Q-functions (via [Normalized Advantage Functions](https://arxiv.org/abs/1603.00748)) in the same way."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [],
   "source": [
    "class QFunction(torch.nn.Module):\n",
    "\n",
    "    def __init__(self, obs_size, n_actions):\n",
    "        super().__init__()\n",
    "        self.l1 = torch.nn.Linear(obs_size, 50)\n",
    "        self.l2 = torch.nn.Linear(50, 50)\n",
    "        self.l3 = torch.nn.Linear(50, n_actions)\n",
    "\n",
    "    def forward(self, x):\n",
    "        h = x\n",
    "        h = torch.nn.functional.relu(self.l1(h))\n",
    "        h = torch.nn.functional.relu(self.l2(h))\n",
    "        h = self.l3(h)\n",
    "        return pfrl.action_value.DiscreteActionValue(h)\n",
    "\n",
    "obs_size = env.observation_space.low.size\n",
    "n_actions = env.action_space.n\n",
    "q_func = QFunction(obs_size, n_actions)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "It is also possible to define the same model using `torch.nn.Sequential`. `pfrl.q_functions.DiscreteActionValueHead` is just a `torch.nn.Module` that packs its input to `pfrl.action_value.DiscreteActionValue`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [],
   "source": [
    "q_func2 = torch.nn.Sequential(\n",
    "    torch.nn.Linear(obs_size, 50),\n",
    "    torch.nn.ReLU(),\n",
    "    torch.nn.Linear(50, 50),\n",
    "    torch.nn.ReLU(),\n",
    "    torch.nn.Linear(50, n_actions),\n",
    "    pfrl.q_functions.DiscreteActionValueHead(),\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As usual in PyTorch, `torch.optim.Optimizer` is used to optimize a model."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [],
   "source": [
    "# Use Adam to optimize q_func. eps=1e-2 is for stability.\n",
    "optimizer = torch.optim.Adam(q_func.parameters(), eps=1e-2)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To create a DoubleDQN agent with these Q-function and optimizer, you need to specify a bit more parameters and configurations."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [],
   "source": [
    "# Set the discount factor that discounts future rewards.\n",
    "gamma = 0.9\n",
    "\n",
    "# Use epsilon-greedy for exploration\n",
    "explorer = pfrl.explorers.ConstantEpsilonGreedy(\n",
    "    epsilon=0.3, random_action_func=env.action_space.sample)\n",
    "\n",
    "# DQN uses Experience Replay.\n",
    "# Specify a replay buffer and its capacity.\n",
    "replay_buffer = pfrl.replay_buffers.ReplayBuffer(capacity=10 ** 6)\n",
    "\n",
    "# Since observations from CartPole-v0 is numpy.float64 while\n",
    "# As PyTorch only accepts numpy.float32 by default, specify\n",
    "# a converter as a feature extractor function phi.\n",
    "phi = lambda x: x.astype(numpy.float32, copy=False)\n",
    "\n",
    "# Set the device id to use GPU. To use CPU only, set it to -1.\n",
    "gpu = -1\n",
    "\n",
    "# Now create an agent that will interact with the environment.\n",
    "agent = pfrl.agents.DoubleDQN(\n",
    "    q_func,\n",
    "    optimizer,\n",
    "    replay_buffer,\n",
    "    gamma,\n",
    "    explorer,\n",
    "    replay_start_size=500,\n",
    "    update_interval=1,\n",
    "    target_update_interval=100,\n",
    "    phi=phi,\n",
    "    gpu=gpu,\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now you have an agent and an environment. It's time to start reinforcement learning!\n",
    "\n",
    "During training, two methods of `agent` must be called: `agent.act` and `agent.observe`. `agent.act(obs)` takes the current observation as input and returns an exploratory action. Once the returned action is processed in the env, `agent.observe(obs, reward, done, reset)` then observes the consequences:\n",
    "- `obs`: next observation.\n",
    "- `reward`: an immediate reward.\n",
    "- `done`: a boolean value set to True if it reached a terminal state.\n",
    "- `reset`: a boolean value set to True if an episode is interrupted at a non-terminal state, typically by a time limit.\n",
    "\n",
    "Optionally, you can get training statistics of the agent via `agent.get_statistics`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "episode: 10 R: 12.0\n",
      "episode: 20 R: 10.0\n",
      "episode: 30 R: 9.0\n",
      "episode: 40 R: 12.0\n",
      "episode: 50 R: 10.0\n",
      "statistics: [('average_q', 0.86276), ('average_loss', 0.18341776728630066), ('cumulative_steps', 565), ('n_updates', 66), ('rlen', 565)]\n",
      "episode: 60 R: 10.0\n",
      "episode: 70 R: 14.0\n",
      "episode: 80 R: 16.0\n",
      "episode: 90 R: 10.0\n",
      "episode: 100 R: 10.0\n",
      "statistics: [('average_q', 5.4785624), ('average_loss', 0.23100754196755588), ('cumulative_steps', 1300), ('n_updates', 801), ('rlen', 1300)]\n",
      "episode: 110 R: 16.0\n",
      "episode: 120 R: 34.0\n",
      "episode: 130 R: 20.0\n",
      "episode: 140 R: 20.0\n",
      "episode: 150 R: 38.0\n",
      "statistics: [('average_q', 8.537258), ('average_loss', 0.29845759100979197), ('cumulative_steps', 2633), ('n_updates', 2134), ('rlen', 2633)]\n",
      "episode: 160 R: 65.0\n",
      "episode: 170 R: 144.0\n",
      "episode: 180 R: 200.0\n",
      "episode: 190 R: 200.0\n",
      "episode: 200 R: 200.0\n",
      "statistics: [('average_q', 10.152343), ('average_loss', 0.10948933256324381), ('cumulative_steps', 9775), ('n_updates', 9276), ('rlen', 9775)]\n",
      "episode: 210 R: 200.0\n",
      "episode: 220 R: 200.0\n",
      "episode: 230 R: 186.0\n",
      "episode: 240 R: 200.0\n",
      "episode: 250 R: 200.0\n",
      "statistics: [('average_q', 10.212633), ('average_loss', 0.06329123534378596), ('cumulative_steps', 18559), ('n_updates', 18060), ('rlen', 18559)]\n",
      "episode: 260 R: 200.0\n",
      "episode: 270 R: 200.0\n",
      "episode: 280 R: 76.0\n",
      "episode: 290 R: 27.0\n",
      "episode: 300 R: 200.0\n",
      "statistics: [('average_q', 9.8198185), ('average_loss', 0.07038218982575926), ('cumulative_steps', 26390), ('n_updates', 25891), ('rlen', 26390)]\n",
      "Finished.\n"
     ]
    }
   ],
   "source": [
    "n_episodes = 300\n",
    "max_episode_len = 200\n",
    "for i in range(1, n_episodes + 1):\n",
    "    obs = env.reset()\n",
    "    R = 0  # return (sum of rewards)\n",
    "    t = 0  # time step\n",
    "    while True:\n",
    "        # Uncomment to watch the behavior in a GUI window\n",
    "        # env.render()\n",
    "        action = agent.act(obs)\n",
    "        obs, reward, done, _ = env.step(action)\n",
    "        R += reward\n",
    "        t += 1\n",
    "        reset = t == max_episode_len\n",
    "        agent.observe(obs, reward, done, reset)\n",
    "        if done or reset:\n",
    "            break\n",
    "    if i % 10 == 0:\n",
    "        print('episode:', i, 'R:', R)\n",
    "    if i % 50 == 0:\n",
    "        print('statistics:', agent.get_statistics())\n",
    "print('Finished.')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now you finished training the DoubleDQN agent for 300 episodes. How good is the agent now? You can evaluate it by using `with agent.eval_mode()`. Exploration such as epsilon-greedy is not used anymore."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "evaluation episode: 0 R: 200.0\n",
      "evaluation episode: 1 R: 200.0\n",
      "evaluation episode: 2 R: 171.0\n",
      "evaluation episode: 3 R: 173.0\n",
      "evaluation episode: 4 R: 200.0\n",
      "evaluation episode: 5 R: 200.0\n",
      "evaluation episode: 6 R: 200.0\n",
      "evaluation episode: 7 R: 198.0\n",
      "evaluation episode: 8 R: 200.0\n",
      "evaluation episode: 9 R: 200.0\n"
     ]
    }
   ],
   "source": [
    "with agent.eval_mode():\n",
    "    for i in range(10):\n",
    "        obs = env.reset()\n",
    "        R = 0\n",
    "        t = 0\n",
    "        while True:\n",
    "            # Uncomment to watch the behavior in a GUI window\n",
    "            # env.render()\n",
    "            action = agent.act(obs)\n",
    "            obs, r, done, _ = env.step(action)\n",
    "            R += r\n",
    "            t += 1\n",
    "            reset = t == 200\n",
    "            agent.observe(obs, r, done, reset)\n",
    "            if done or reset:\n",
    "                break\n",
    "        print('evaluation episode:', i, 'R:', R)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For your information, `CartPole-v0`'s maximum achievable return is 200. If the agent could not achieve 200, it was unlucky! You can train the agent longer by running the training loop again.\n",
    "\n",
    "If the results are good enough, the only remaining task is to save the agent so that you can reuse it. What you need to do is to simply call `agent.save` to save the agent, then `agent.load` to load the saved agent."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [],
   "source": [
    "# Save an agent to the 'agent' directory\n",
    "agent.save('agent')\n",
    "\n",
    "# Uncomment to load an agent from the 'agent' directory\n",
    "# agent.load('agent')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "RL completed!\n",
    "\n",
    "But writing code like this every time you use RL might be tedious. So, PFRL has utility functions that do these things."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "outdir:result step:180 episode:0 R:180.0\n",
      "statistics:[('average_q', 9.803836), ('average_loss', 0.05234420951426728), ('cumulative_steps', 26570), ('n_updates', 26071), ('rlen', 26570)]\n",
      "outdir:result step:344 episode:1 R:164.0\n",
      "statistics:[('average_q', 9.851653), ('average_loss', 0.07100017969845794), ('cumulative_steps', 26734), ('n_updates', 26235), ('rlen', 26734)]\n",
      "outdir:result step:474 episode:2 R:130.0\n",
      "statistics:[('average_q', 9.863388), ('average_loss', 0.05790386739885434), ('cumulative_steps', 26864), ('n_updates', 26365), ('rlen', 26864)]\n",
      "outdir:result step:584 episode:3 R:110.0\n",
      "statistics:[('average_q', 9.890334), ('average_loss', 0.05864107925037388), ('cumulative_steps', 26974), ('n_updates', 26475), ('rlen', 26974)]\n",
      "outdir:result step:712 episode:4 R:128.0\n",
      "statistics:[('average_q', 9.924887), ('average_loss', 0.07561466434504836), ('cumulative_steps', 27102), ('n_updates', 26603), ('rlen', 27102)]\n",
      "outdir:result step:878 episode:5 R:166.0\n",
      "statistics:[('average_q', 9.868885), ('average_loss', 0.05455060393956956), ('cumulative_steps', 27268), ('n_updates', 26769), ('rlen', 27268)]\n",
      "outdir:result step:947 episode:6 R:69.0\n",
      "statistics:[('average_q', 9.916588), ('average_loss', 0.07421551645325962), ('cumulative_steps', 27337), ('n_updates', 26838), ('rlen', 27337)]\n",
      "outdir:result step:1083 episode:7 R:136.0\n",
      "statistics:[('average_q', 9.856746), ('average_loss', 0.06856428321392741), ('cumulative_steps', 27473), ('n_updates', 26974), ('rlen', 27473)]\n",
      "evaluation episode 0 length:200 R:200.0\n",
      "evaluation episode 1 length:200 R:200.0\n",
      "evaluation episode 2 length:200 R:200.0\n",
      "evaluation episode 3 length:200 R:200.0\n",
      "evaluation episode 4 length:200 R:200.0\n",
      "evaluation episode 5 length:200 R:200.0\n",
      "evaluation episode 6 length:200 R:200.0\n",
      "evaluation episode 7 length:200 R:200.0\n",
      "evaluation episode 8 length:200 R:200.0\n",
      "evaluation episode 9 length:200 R:200.0\n",
      "The best score is updated -3.4028235e+38 -> 200.0\n",
      "Saved the agent to result/best\n",
      "outdir:result step:1142 episode:8 R:59.0\n",
      "statistics:[('average_q', 9.853057), ('average_loss', 0.05662544719700236), ('cumulative_steps', 27532), ('n_updates', 27033), ('rlen', 27532)]\n",
      "outdir:result step:1158 episode:9 R:16.0\n",
      "statistics:[('average_q', 9.872034), ('average_loss', 0.05622985374648124), ('cumulative_steps', 27548), ('n_updates', 27049), ('rlen', 27548)]\n",
      "outdir:result step:1237 episode:10 R:79.0\n",
      "statistics:[('average_q', 9.806777), ('average_loss', 0.07386907671520021), ('cumulative_steps', 27627), ('n_updates', 27128), ('rlen', 27627)]\n",
      "outdir:result step:1333 episode:11 R:96.0\n",
      "statistics:[('average_q', 9.882676), ('average_loss', 0.0502405039104633), ('cumulative_steps', 27723), ('n_updates', 27224), ('rlen', 27723)]\n",
      "outdir:result step:1386 episode:12 R:53.0\n",
      "statistics:[('average_q', 9.844122), ('average_loss', 0.06962731331412214), ('cumulative_steps', 27776), ('n_updates', 27277), ('rlen', 27776)]\n",
      "outdir:result step:1475 episode:13 R:89.0\n",
      "statistics:[('average_q', 9.931256), ('average_loss', 0.05207158264005557), ('cumulative_steps', 27865), ('n_updates', 27366), ('rlen', 27865)]\n",
      "outdir:result step:1512 episode:14 R:37.0\n",
      "statistics:[('average_q', 9.915121), ('average_loss', 0.05157772636157461), ('cumulative_steps', 27902), ('n_updates', 27403), ('rlen', 27902)]\n",
      "outdir:result step:1611 episode:15 R:99.0\n",
      "statistics:[('average_q', 9.915375), ('average_loss', 0.0663842154305894), ('cumulative_steps', 28001), ('n_updates', 27502), ('rlen', 28001)]\n",
      "outdir:result step:1761 episode:16 R:150.0\n",
      "statistics:[('average_q', 9.903722), ('average_loss', 0.06313523986464134), ('cumulative_steps', 28151), ('n_updates', 27652), ('rlen', 28151)]\n",
      "outdir:result step:1847 episode:17 R:86.0\n",
      "statistics:[('average_q', 9.898549), ('average_loss', 0.0762687680486124), ('cumulative_steps', 28237), ('n_updates', 27738), ('rlen', 28237)]\n",
      "outdir:result step:1929 episode:18 R:82.0\n",
      "statistics:[('average_q', 9.877551), ('average_loss', 0.0802266744733788), ('cumulative_steps', 28319), ('n_updates', 27820), ('rlen', 28319)]\n",
      "outdir:result step:2000 episode:19 R:71.0\n",
      "statistics:[('average_q', 9.86221), ('average_loss', 0.05153410392580554), ('cumulative_steps', 28390), ('n_updates', 27891), ('rlen', 28390)]\n",
      "evaluation episode 0 length:200 R:200.0\n",
      "evaluation episode 1 length:200 R:200.0\n",
      "evaluation episode 2 length:200 R:200.0\n",
      "evaluation episode 3 length:200 R:200.0\n",
      "evaluation episode 4 length:200 R:200.0\n",
      "evaluation episode 5 length:200 R:200.0\n",
      "evaluation episode 6 length:200 R:200.0\n",
      "evaluation episode 7 length:200 R:200.0\n",
      "evaluation episode 8 length:200 R:200.0\n",
      "evaluation episode 9 length:197 R:197.0\n",
      "Saved the agent to result/2000_finish\n"
     ]
    }
   ],
   "source": [
    "# Set up the logger to print info messages for understandability.\n",
    "import logging\n",
    "import sys\n",
    "logging.basicConfig(level=logging.INFO, stream=sys.stdout, format='')\n",
    "\n",
    "pfrl.experiments.train_agent_with_evaluation(\n",
    "    agent,\n",
    "    env,\n",
    "    steps=2000,           # Train the agent for 2000 steps\n",
    "    eval_n_steps=None,       # We evaluate for episodes, not time\n",
    "    eval_n_episodes=10,       # 10 episodes are sampled for each evaluation\n",
    "    train_max_episode_len=200,  # Maximum length of each episode\n",
    "    eval_interval=1000,   # Evaluate the agent after every 1000 steps\n",
    "    outdir='result',      # Save everything to 'result' directory\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "That's all of the PFRL quickstart guide. To know more about PFRL, please look into the `examples` directory and read and run the examples. Thank you!"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
